RBiotools is a package for COMPARATIVE MICROBIAL GENOMICS. RBiotools works best in the RStudio integrated development environment. RBiotools works with IDs from GenBank, the NIH genetic sequence data base, to perform analyses and create figures.
Before installation,
1. install and update dependencies
#Install some basic dependencies that are required for RBiotools
install.packages(c("rlang","ape", "fmsb", "gplots", "grImport", "gridExtra", "msa", "pheatmap", "RCurl", "rentrez", "seqinr"), repos =" http://cran.us.r-project.org ")
install.packages(c("lattice","mgcv","nlme","survival"), repos =" http://cran.us.r-project.org ")
#Install msa and Biostrings package via BiocManager (and install BiocManager if it is not already installed)
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager", repos =" http://cran.us.r-project.org ")
BiocManager::install("msa")
BiocManager::install("Biostrings", force = TRUE)
#Install installr - needed to load RBiotools from tar.gz
install.packages("installr", repos =" http://cran.us.r-project.org ")
## this library call is required so the dendrogram will work
library(msa)
## this library call is required so we can install RBiotools from the zip source
library(installr)
2. install the RBiotools package from zip source; and load the RBiotools library
install.packages("C:/dc3/RBiotools_0.5.0.tar.gz", repos = NULL, type = "source")
library(RBiotools)
3. Setup the environment and working directory
Choose a directory / folder that will be used for some of the figures, including Heat Maps and Genome Atlases:
ifelse(!dir.exists(".//r_repository//"), dir.create(".//r_repository//"), "Folder exists already")
setwd("C:/dc3/R_repository") #setting the working directory
The first step is to choose a set of organisms that you would like to explore.
What is an organism?
Let’s use a sample sets of organisms to get started. Specifically, a list of E. coli and Shigella organisms. This set will be stored using the name eColi.
eColi <- c(
"AP012306", # Escherichia coli str. K-12 substr. MDS42 DNA 3.976 Mb - smallest chromosome
"KK583188", # Escherichia coli DSM 30083 = JCM 1649 = ATCC 11775 4.907 Mb - type strain scaffold
"U00096", # Escherichia coli str. K-12 substr. MG1655 4.642 Mb - first E. coli genome sequenced
"CP000802", # Escherichia coli HS 4.644 Mb - group A representative, commensal
"CP000800", # Escherichia coli E24377A 4.980 Mb - group B1 representative
"AP009378", # Escherichia coli SE15 4.717 Mb - group B2 representative, commensal
"FM180568", # Escherichia coli 0127:H6 E2348/69 4.966 Mb - group B2 representative, enteropathogenic
"CU928163", # Escherichia coli UMN026 5.202 Mb - group D representative
"CP008957", # Escherichia coli O157:H7 str. EDL933 5.547 Mp - group E representative
"CP027027", # Shigella dysenteriae strain E670/74 5.037 Mb - Shigella dysenteria representative
"CP026802", # Shigella sonnei strain ATCC 29930 4.975 Mb - Shigella sonnei representative
"CP026877", # Shigella boydii strain ATCC 35964 5.129 Mb - Shigella boydii representative
"CP026793", # Shigella flexneri strain 74-1170 4.734 Mb - Shigella flexneri representative
"CP015831" # Escherichia coli O157 strain 644-PT8 5.831 Mb - largest chromosome
)
and here we have a list of organisms that are not as closely related from the Proteobacteria classes. This set will be stored using the name proteobacteria.
proteobacteria <- c(
"CP018228", # Rhizobium leguminosarum strain Vaf-108 Phylum: Proteobacteria (alpha)
"CP017405", # Bordetella pertussis strain J448 Phylum: Proteobacteria (beta)
"CP008957", # Escherichia coli O157:H7 str. EDL933 Phylum: Proteobacteria (gamma)
"HG530068", # Pseudomonas aeruginosa PA38182 Phylum: Proteobacteria (gamma)
"CP002031", # Geobacter sulfurreducens KN400 Phylum: Proteobacteria (delta)
"CP002332" # Helicobacter pylori Gambia94/24 Phylum: Proteobacteria (epsilon)
)
Plot a dendrogram using 16S rRNA genes
#This first plot is for the eColi and Shigella organisms
dendrogram16S(eColi)
## Initializing RBiotools
## Downloading genome data for organism with accession ID: AP012306
## Downloading genome data for organism with accession ID: KK583188
## Downloading genome data for organism with accession ID: U00096
## Downloading genome data for organism with accession ID: CP000802
## Downloading genome data for organism with accession ID: CP000800
## Downloading genome data for organism with accession ID: AP009378
## Downloading genome data for organism with accession ID: FM180568
## Downloading genome data for organism with accession ID: CU928163
## Downloading genome data for organism with accession ID: CP008957
## Downloading genome data for organism with accession ID: CP027027
## Downloading genome data for organism with accession ID: CP026802
## Downloading genome data for organism with accession ID: CP026877
## Downloading genome data for organism with accession ID: CP026793
## Downloading genome data for organism with accession ID: CP015831
## Calling type 'SSU' rRNA genes for genome with accession ID: AP012306
## Calling type 'SSU' rRNA genes for genome with accession ID: KK583188
## Calling type 'SSU' rRNA genes for genome with accession ID: U00096
## Calling type 'SSU' rRNA genes for genome with accession ID: CP000802
## Calling type 'SSU' rRNA genes for genome with accession ID: CP000800
## Calling type 'SSU' rRNA genes for genome with accession ID: AP009378
## Calling type 'SSU' rRNA genes for genome with accession ID: FM180568
## Calling type 'SSU' rRNA genes for genome with accession ID: CU928163
## Calling type 'SSU' rRNA genes for genome with accession ID: CP008957
## Calling type 'SSU' rRNA genes for genome with accession ID: CP027027
## Calling type 'SSU' rRNA genes for genome with accession ID: CP026802
## Calling type 'SSU' rRNA genes for genome with accession ID: CP026877
## Calling type 'SSU' rRNA genes for genome with accession ID: CP026793
## Calling type 'SSU' rRNA genes for genome with accession ID: CP015831
## Creating dendrogram for 14 16S rRNA sequences
## use default substitution matrix
#The second plot is for the proteobacteria organisms
dendrogram16S(proteobacteria)
## Downloading genome data for organism with accession ID: CP018228
## Downloading genome data for organism with accession ID: CP017405
## Downloading genome data for organism with accession ID: HG530068
## Downloading genome data for organism with accession ID: CP002031
## Downloading genome data for organism with accession ID: CP002332
## Calling type 'SSU' rRNA genes for genome with accession ID: CP018228
## Calling type 'SSU' rRNA genes for genome with accession ID: CP017405
## Calling type 'SSU' rRNA genes for genome with accession ID: HG530068
## Calling type 'SSU' rRNA genes for genome with accession ID: CP002031
## Calling type 'SSU' rRNA genes for genome with accession ID: CP002332
## Creating dendrogram for 6 16S rRNA sequences
## use default substitution matrix
createAtlas(eColi[1]) # here we index into the list and grab the first item; AP012306 or E.coli str. K-12 substr. MDS42 DNA
Genome Atlas for eColi group
Note: To view this Genome Atlas, which is stored as an SVG file
* Open a browser ... Safari, Chrome, Firefox, ...
* Click "File" in the menu bar and choose "Open File ..." from the pull-down menu
* Navigate to your Working Directory
* Double-click on "GenomeAtlas_AP012306.svg"
Note: The plotUsage function has many options. See the plotUsage documentation.
plotUsage(eColi[1])
## ----------
## Predicting protein genes for genome with accession ID: AP012306
## Request: Single Genome, Phase: Training
## Reading in the sequence(s) to train ... 3976195 bp sequence created, 51.14% GC
## Looking for GC bias in different frames ... frame bias scores: 1.56 0.182 1.25
## Building initial set of genes to train from ...
## done!
## Creating coding model and scoring nodes ... done!
## Examining upstream regions and training starts ...
## done!
## Request: Single Genome, Phase: Gene Finding
## Finding genes in sequence # 1 (3976195 bp) ...
## done! 3621 genes predicted
## Plotting AA usage across 3621 proteins
Note: Heat Maps work best for groups of organisms that are NOT closely related (like our Proteobacteria group?)
plotHeatMapCodon(proteobacteria)
## Calling protein genes for accession ID: CP018228
## Calling protein genes for accession ID: CP017405
## Calling protein genes for accession ID: CP008957
## Calling protein genes for accession ID: HG530068
## Calling protein genes for accession ID: CP002031
## Calling protein genes for accession ID: CP002332
Heat Map Codon for Proteobacteria group
Note: The heat map is now in your Working Directory.
* Click on the "Files" tab in the lower left quadrant
* Navigate to your Working Directory (hint: you may need to click on the name of your Working Directory to update it)
* Click on the "HeatMapCodon.png" file
Note: Blast Matrices work best for groups of closely related organisms (like our eColi group?)
proteinGrouping <- runLinclust(eColi)
## Calling protein genes for accession ID: KK583188
## Calling protein genes for accession ID: U00096
## Calling protein genes for accession ID: CP000802
## Calling protein genes for accession ID: CP000800
## Calling protein genes for accession ID: AP009378
## Calling protein genes for accession ID: FM180568
## Calling protein genes for accession ID: CU928163
## Calling protein genes for accession ID: CP027027
## Calling protein genes for accession ID: CP026802
## Calling protein genes for accession ID: CP026877
## Calling protein genes for accession ID: CP026793
## Calling protein genes for accession ID: CP015831
## Number of ungapped alignments: 112023
## Number of Smith-Waterman alignments: 2514
plotBlastMatrix(proteinGrouping)
Blast MAtrix for eColi group
Note: To view this Homology Matrix, which is stored as an SVG file
* Open a browser ... Safari, Chrome, Firefox, ...
* Click "File" in the menu bar and choose "Open File ..." from the pull-down menu
* Navigate to your Working Directory
* Double-click on "BlastMatrix.svg"